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PERSONNEL  RATING  EFFECTIVENESS  AS  A FUNCTION 
OF  NUMBER  OF  RATING  STATEMENTS 


I.  INTRODUCTION 

The  literature  contains  a large  number  of  studies  issuing  from  the  search  for  appropriate 
rating  constructs  to  be  used  in  the  collection  of  rated  data.  The  results  have  been  rather 
disappointing,  but  still  the  search  goes  on.  The  pursuit  of  rating  constructs  (or  “factors”)  is 
probably  due  to  the  enormous  influence  of  Thurstone's  work  with  the  factor  analysis  of  test  data 
and  to  his  conclusion  that  complex  human  characteristics  can  best  be  explained  in  terms  of  a few 
orthogonal  factors  that  is.  factors  which  are  not  correlated  with  each  other.  American 
psychologists,  in  general,  have  accepted  Thurstone’s  position.  Those  who  have  worked  with  rating 
data  have  started  from  the  assumption  that  the  concept  of  orthogonality  is  almost  a natural  law. 
If  one  accepts  that  assumption,  it  is  reasonable  that  one  of  the  primary  goals  of  rating  research 
has  been  to  find  that  set  of  independent  (orthogonal)  constructs  which  best  describes  human 
behavior  when  rating  data  are  used.  It  is,  after  all.  merely  an  extension  into  rating  data  of  a 
principle  which  has  been  accepted  broadly  as  a fundamental  concept  in  test  data. 

There  are,  however,  at  least  four  major  difficulties  that  have  beset  researchers  in  their  quest 
for  simple  structure  in  rating  data.  These  difficulties  are  as  follows: 

1.  Orthogonality  as  a Concept.  Although  the  concept  of  orthogonality  as  a requisite  for 
factors  has  been  persuasive  to  American  psychologists,  not  all  prominent  modern  psychologists  have 
succumbed  to  the  attractiveness  of  Thurstone's  arguments  for  the  primacy  of  specific  or  orthogonal 
factors  to  describe  human  abilities  (e.g.,  Horn,  1968;  Humphreys,  1962.  Jensen,  1966;  McNemar, 
1964.  to  name  only  a few).  Indeed.  McNemar  (1964)  has  pointed  out  a serious  weakness  in  the 
entire  factor  analytic  process: 


In  practically  all  areas  of  psychological  research  the  demonstration  of  trivially 
small  minutiae  is  doomed  to  failure  because  of  random  errors.  Not  so  if  your 
technique  is  factor  analysis,  despite  its  being  based  on  the  correlation 
coefficient  that  slipperiest  of  all  statistical  measures.  By  some  magic,  hypotheses 
are  tested  without  significance  tests.  This  happy  situation  permits  me  to 
announce  a Principle  of  Psychological  Regress:  Use  statistical  techniques  that  lack 
inferential  power.  This  will  not  inhibit  your  power  of  subjective  inference. 

In  the  same  article  (a  discussion  of  the  concept  of  intelligence),  McNemar  finds  no  advantage 
of  fractionating  general  mental  ability  into  differentially  weighted  independent  separate  factors,  even 
in  predicting  meaningful  criteria.  The  problem  of  finding  separate  rating  “factors"  is  quite 

analogous.  We  have  no  convincing  evidence  that  separate  rating  statements  will  provide  data  that 
are  more  useful  than  one  global  rating  of  all-around  excellence.  There  may  not  he  any  set  of 
rating  “factors"  in  the  simple  structure  sense. 

2.  Theory  Weakness.  If  rating  “factors"  exist,  it  is  not  at  all  clear  in  what  direction  they 
may  lie.  There  is  no  widely  accepted  theory  which  provides  clues  to  the  researcher  to  aid  him  in 
his  search.  Without  such  clues,  the  number  of  descriptive  qualities,  interacting  with  ways  of 

expressing  those  qualities,  is  literally  almost  endless.  This  is  one  reason  so  much  effort  has  been 
expended  in  the  search  for  the  best  rating  statements.  In  test  theory,  it  is  known  that  certain 
factors  (e.g.,  verbal,  numerical)  are  stable  and  replicable-although  some  have  questioned  the  utility 

of  large  factor  sets.  In  rating  theory,  we  do  not  even  know  the  best  format  for  collecting  data, 

much  less  which  constructs  are  more  likely  to  yield  useful  information. 
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It  is  not  even  dear  how  rating  questions  should  be  worded.  For  example,  there  has  been 
considerable  controversy  over  whether  statements  oriented  around  tasks  performed  (“adjusts  the 
linkage  on  the  clutch  pedal")  or  statements  oriented  around  personal  characteristics  of  the  ratee 
("forceful  and  dominant  in  interpersonal  relations”)  are  more  useful  in  describing  ratees  (Kavanagh, 
1971;  Massey,  Mullins,  & Earles,  1978).  Generally,  on  this  issue,  task-oriented  statements  appear  to 
be  slightly  better  by  internal  psychometric  standards  (slightly  less  inflated  means,  slightly  larger 
standard  deviations,  larger  reliability  coefficients),  but  no  differences  are  usually  observed  when 
evaluation  of  the  rating  statements  is  made  by  applying  an  external  criterion.  This  problem  of 
theory  weakness  is  much  more  severe  in  a search  of  rating  data  for  rating  factors  than  it  is  in  a 
search  of  test  score  data  for  intellectual  factors,  because  the  universe  of  discourse  is  so  much 
larger  and  harder  to  define. 

3.  Differential  Description.  Even  if  reasonably  good  factors  could  be  deduced  from  some 
theory,  and  even  if  they  really  were  present  in  a given  rating  situation,  there  is  no  assurance  that 
they  oould  be  demonstrated  from  the  rating  data  collected.  All  psychologists  are  familiar  with  the 
halo  phenomenon  in  ratings,  and  the  halo  effect  is  possibly  strong  enough  that  the  average  rater 
simply  cannot  produce  differentiation  among  ratee  characteristics  sharp  enough  and  objective 
enough  that  the  factors  would  show  up  in  the  data  analysis.  But  the  first  burden  of  a set  of 
rating  factors-if  they  are  really  worthwhile -must  be  to  describe  differentially  the  members  of  a 
ratee  group.  If  a set  of  rating  statements  does  not  paint  a unique  picture  of  each  ratee  with 
recognizable  differences  between  his  picture  and  that  of  each  other  member,  it  is  difficult  to  see 
how  that  set  of  rating  statements  could  produce  useful  validities  against  any  reasonable  outside 
criterion. 

4.  Criterion  Problems.  Ratings  are  usually  collected  to  serve  as  a criterion,  rather  than 
predictor,  variable.  One  of  the  reasons  rating  data  are  used  is  that  the  investigator  can  find  no 
other  way  of  measuring  the  variable  of  interest.  Therefore,  the  rating  is  the  most  “ultimate”  score 
one  can  collect.  There  is  no  available  metric  closer  to  the  true  score  than  the  ratings  themselves. 
If  one  accepts  the  position  that  the  rating  score  is  the  ultimate  criterion,  then  of  course  one 
cannot  question  its  validity.  It  is  by  definition  perfectly  valid.  In  such  a case,  one  can  only 
investigate  certain  internal  psychometric  characteristics,  such  as  its  reliability  (Remmers,  1934,  p. 
621).  In  some  situations,  particularly  in  the  operational  use  of  ratings,  this  can  sometimes  be  a 
reasonable  position. 

In  doing  research  on  rating  methodology,  however,  it  seems  essential  to  have  some  other 
criterion  available  which  allows  one  to  compare  the  “goodness”  of  one  rating  or  set  of  ratings 
against  another.  It  is  of  little  use  to  compare  reliabilities,  means,  standard  deviations,  and  other 
internal  psychometric  characteristics  if  one  is  trying  to  determine  which  set  of  ratings  is  better  at 
measuring  a particular  condition. 

Previous  studies  in  this  series  of  investigations  (Curton,  Ratliff,  & Mullins,  1979;  Massey, 
Mullins,  & Earles,  1978)  have  attempted  to  discover  qualities  of  “good”  rating  statements 
compared  with  “poor”  rating  statements.  The  methodology  has  been  what  one  might  expect  from 
the  difficulties  labeled  3 and  4,  above,  discussing  differential  description  and  criterion  problems, 
respectively.  Different  sets  of  rating  statements  were  compared  by  observing  their  relative  merits  in 
differentially  describing  ratees,  and  in  their  prediction  of  external  criteria.  None  of  the  sets  of 
rating  statements  investigated  so  far  have  shown  any  superiority  over  any  of  the  others. 

Judging  from  the  results  available  so  far  in  the  literature,  it  may  well  be  that  perhaps  all 
that  the  average  rater  can  do  effectively  is  rate  on  some  general  overall  idea  of  excellence  and 
that  requiring  the  rater  to  rate  separate  characteristics  independently  is  beyond  a person’s 
capability.  If  this  is  so,  it  is  another  way  of  saying  that  halo  error  overwhelms  the  variance  in 
sets  of  rating  statements  and  that  sets  of  rating  statements  are  not  more  efficient  than  a single 
rating.  This  is  a study  of  the  relative  effectiveness  of  requiring  raters  to  rate  varying  numbers  of 
statements,  with  effectiveness  defined  by  criteria  external  to  the  ratings. 
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II.  METHOD 


Rating  Rationale 

The  two  studies  cited  previously  were  based  on  the  premise  that  if  rating  statements  are 
actually  meaningful,  raters  should  be  able  to  identify  unlabelled  profiles  made  by  their  peers  from 
those  rating  statements.  The  two  previous  studies  indicated  that  correct  identifications  (hits)  are 
too  few  to  provide  a reasonable  degree  of  sensitivity.  The  average  number  of  hits  per  rater,  rating 
14  peers,  is  only  about  2.5.  Therefore,  a refined  method  of  hits  was  also  used,  called  the  rank 
order  (RO)  method. 

The  RO  method  provides  a way  to  credit  near  misses.  The  hits  approach  is  all  or  nothing. 
The  rating  subject  either  guesses  the  profile  correctly  or  does  not.  The  rater  may  be  sure  that  the 
profile  being  studied  is  either,  say,  peer  B or  peer  F,  so  commits  to  B.  If  the  profile  really 
belongs  to  peer  F,  the  rater  not  only  gets  no  credit  for  being  close  on  the  identification  of  peer 
Fs  profile,  but  also  misses  peer  B’s  profile  as  well.  The  RO  method,  though  a little  difficult  to 
understand,  is  an  approach  designed  to  make  the  hits  method  more  sensitive. 

If,  in  addition  to  being  asked  for  absolute  hits,  the  rater  is  asked  to  rank  the  unidentified 
profiles  in  terms  of  some  standard  of  excellence  (e.g.,  how  well  the  people  with  these  profiles  will 
do  in  a course  of  instruction),  and  if  the  rater  is  also  given  a list  of  the  names  of  peers  and  is 
asked  to  rank  these  peers  on  the  same  standard  of  excellence,  and  then  the  rank  differences  are 
analyzed,  this  is  in  a real  sense  a measure  of  hits  which  gives  credit  for  near  misses.  For  example, 
if  the  rater  believes  correctly  that  the  three  profiles  which  appear  to  be  the  “best”  in  terms  of 
most  likely  to  succeed  belong  to  peers  B,  C,  and  F,  but  is  not  sure  which  is  which,  the  ranking 
approach  will  provide  credit  for  placing  these  peers  in  the  proper  end  of  the  ranking,  whereas 
absolute  hits  might  give  the  rater  no  credit  at  all. 

Another  external  criterion  for  judging  the  relative  efficacy  of  various  sets  of  rating  statements 
can  be  some  typical  success  criterion,  such  as  the  final  grade  upon  graduation  from  school.  This 
criterion  is  not  quite  as  direct  as  peer  identification  for  judging  the  quality  of  peer  ratings, 
because  an  additional  element,  validity  of  chosen  statements,  becomes  a consideration.  Using  this 
criterion,  not  only  must  each  rating  statement  contribute  to  a sound  differential  description  of  the 
ratee,  but  also  it  must  be  a statement  which  happens  to  be  valid  for  that  success  criterion  in 
order  for  differences  among  rating  statements  to  appear. 

For  example,  one  might  calculate  the  correlation  coefficient  between  the  criterion  and  a set 
of  five  rating  statements  and  then  calculate  another  correlation  coefficient  between  the  criterion 
and  10  rating  statements,  of  which  five  were  the  same  statements  used  in  calculating  the  first 
correlation  coefficient.  By  applying  the  proper  statistics,  one  can  then  determine  whether  or  not 
the  set  of  10  statements  predicted  better  than  the  subset  of  five  statements.  Whether  they  do  is 
determined  not  only  by  the  quality  of  the  ratings  as  descriptors  of  the  ratees  but  also  by  the 
validity  of  the  quality  described  in  the  rating  statement  for  the  chosen  criterion.  One  should  be 
able  to  rate  one’s  peers  rather  accurately  on,  say,  height,  but  that  would  probably  not  be  a valid 
predictor  of  academic  ability.  Therefore,  it  is  believed  that  the  peer  identification  process  is 
probably  the  most  direct  method  of  judging  the  relative  accuracy  of  two  sets  of  rating  statements. 
Both  approaches  were  used  in  this  study. 

Subjects 

Nine  seminar  groups  of  Air  Force  non-commissioned  officers  (NCOs)  (technical  and  master 
sergeants)  assigned  to  the  Air  Training  Command  (ATC)  NCO  Academy  at  Lackland  AFB  Annex 
served  as  subjects  for  this  study.  Seven  of  the  seminar  groups  were  composed  of  15  subjects  each, 
one  of  14  subjects,  and  one  of  13  subjects,  yielding  a total  N of  132,  Length  of  military  service 
for  these  subjects  was  10  to  17  years. 


Procedure* * 

The  nine  seminar  group*  were  randomly  assigned  to  duer  tieatmenf  conditions  of  duee 
senunai  gioups  each.  The  subiects  ui  Treatment  Condition  l weic  asked  to  late  their  peers  on  live 
rating  statements.  Those  in  Treatment  Condition  2 were  asked  to  rale  their  pee  is  on  10  rating 
statements,  live  of  which  were  used  by  Treatment  Condition  I.  In  Treatment  Condition  3,  the 
subjects  rated  their  peers  on  '0  rating  statements,  includurg  the  10  used  by  Treatment  Condition 
2.  All  20  of  the  rating  statements  ate  given  in  the  appendix.  It  should  be  noted  that  some  of  the 
statements  are  very  general  and  person  oriented  in  naluie  (eg..  I.  2.  10)  while  others  are  more 
specific  and  job  related  (eg.,  b.  15.  Id),  Hus  design  provided  45  subjects  on  whom  20  rated 
statements  weie  available,  8l>  on  whom  diete  wete  10  rated  statements,  and  132  who  had  all 
rated  the  same  five  statements.  Summary  statistics  for  all  nitre  seminar  groups  and  three  treatment 
conditions  are  available  in  Table  I. 

I'iihk  I.  Number  of  Profile  Identifications  (Hits) 
by  Treatment  and  by  Seminar  Croup 


Tiwimmi  r 
• StaUimnl* 


laminar  Group 


Troatmtnt  * 
10  tuiomonta 


Troalmsnl  a 
SO  Statomonli 


Croup 

N 

Total  Hits 
Mean  Hits 
SD  Hits 

Treatment 
Total  N 
Total  Hits 
Mean  Hits 
SD  Hits 

T-Ratios 

Treatment  I Versus  2 
Treatment  1 Versus  3 
Treatment  2 Versus  3 


*Not 


■ 

F 

1 

• 

c 

M 

A 

o 

o 

15 

15 

13 

14 

15 

15 

15 

15 

15 

31 

36 

35 

37 

45 

40 

36 

40 

34 

2.07 

2.40 

2.6*1 

2.64 

3.00 

2.67 

2.40 

2.67 

2.27 

1.65 

,05 

1.40 

1.55 

1 00 

1.24 

1.54 

l.*>2 

1 .33 

43 

44 

45 

102 

122 

110 

2.37 

2.77 

2.44 

1.42 

1.07 

1.63 

t = .210* 

t = .163* 
t = .246* 


After  the  ratings  had  all  been  collected,  rating  scores  were  averaged  across  raters,  and  profiles 
were  constructed,  one  for  each  ratee  (see  Appendix  A for  examples).  The  ratee’s  name  was  left 
off  the  profile,  but  the  profile  itself  was  reproduced  in  sufficient  copies  that  all  members  of  the 
group  could  have  all  the  profiles  of  all  their  seminar  group  peers,  unidentified  as  to  name.  In  a 
second  visit  to  the  seminar  groups,  the  subjects  were  given  three  more  tasks  to  perform,  in  the 
following  order.  The  first  task  was  to  study  each  of  the  profiles  and  rank  the  profiles  according 
to  how  *’a  person"  with  that  profile  should  do  in  the  class  the  subjects  were  taking.  The  second 
task  was  to  match  each  of  their  peers  with  one  of  the  profiles  (that  is,  indicate  to  whom  each 
profile  belonged).  Thitd,  each  subject  was  given  a list  of  the  people  in  the  seminar  group  and  was 
told  to  rank  them  according  to  how  well  they  would  do  in  the  course. 

The  data  were  subjected  to  two  analysis  of  variance  treatments  (Tables  2 and  3).  and  then 
they  were  reanalyzed  using  multiple  linear  regression  analysis  (Tables  4 to  7). 


j 


HI  KtSl’tVS  ANO  IHSllSSION 


lable  2 shows  that  theie  weie  no  differences  among  treatment  conditions  in  the  number  of 
"bits  ."  Those  sublets  who  weie  looking  ai  profiles  made  fioni  20  rating  statements  could  idenlltv 
then  peeis  no  botici  than  those  Uvoking  at  piofiles  made  fioni  live  statements  Numbei  ot  tuts,  as 
mentioned  above,  is  a relative!)  insensitive  measiue  of  peei  recognition,  howevei  (fable  I shows 
that  each  group  avetaged  about  2.5  hits! 


/.it'll-  ' Analv  sb  of  Variance  of  Number  of  Correct 
l*rv>fle  Identifications  (Hits I by  t reatment 
and  bv  Seminar  Croup 


to  lire* 

Sum  of 
Squirm 

or 

Mi  in 
Squit* 

r 

Treatment 

4.7.W 

> 

IS  70 

7|77* 

Semuiai  Cioups  Within  fieatmeut 

S.lc*8 

0 

SM 

.440* 

fttoi  (Within  (lumps) 

411.717 

1M 

2.544 

*Noi  ttgiufuint 

fable  .1  shows  the  lesults  of  analysis  of  variance  treatment  of  the  satiated  differences 
between  the  tanking  of  the  unidentified  piotiles  and  the  tanking  of  (vets.  Again,  theie  is  no 
significant  difference  among  groups,  even  using  this  more  sensitive  nvasuie  of  peei  identification 


IjN<-  ' Analvsis  of  Variance  of  Squared  IVviations  between  fnidentitied 
IVofie  Rankings  and  l‘eer  Rankings  b>  fieatmeut  and  b\  Seminar  Croup 


Souroi 

Sum  of 

Squirm 

OF 

Mill) 

Squin 

F 

Treatment 

1 5080.*»5 1 

7540.475 

1 .24* 

Seminar  flumps  Within  Treatments 

44U,SS7.7SU 

b 

58414  b.?l 

2.074* 

It  rot  (Within  Croup) 

445S474.SO0 

174 

28117.414 

**Ny'l  MgnitU40l 


Intercorielation  matnees  among  the  various  groii|vs  of  rating  statements  and  final  school  gtade 
appear  in  fables  4 to  r>  Hie  most  striking  aspect  of  these  correlations  is  then  st/e.  The 

reliabilities  of  the  sepaiate  rating  statenvnts  aie  unknown,  but  it  appears  that  the  mteicoiielations 

ot  each  tating  statement  with  the  others  must  approach  the  statement  reliabilities,  f'oi  example,  in 
fable  4,  144  of  the  I'ht  mtercorrelations  among  lating  statements  aie  .70  oi  higher,  and  4(H  ate 
SO  or  higher  All  this  aigues  lathei  stionglv  for  the  likelihood  that  little  is  being  lated  except  a 
geneial  idea  of  excellence 

lfie  one  wortisome  teat  me  of  these  mtercorrelation  matnees  is  the  fact  that  theie  aie  some 
sizable  differences  among  the  .'0  tating  statements  in  their  validities  against  final  school  giade. 

with  Statement  I exhibiting  the  highest  relationship  with  die  entenon.  In  each  of  the  thiee 

matrices,  this  finding  seems  to  argue  against  the  proposition  that  the  rater  can  evaluate  only  m 
general  terms,  otherwise,  how  could  one  statement  be  more  valid  than  another*  However,  there  ate 
two  possible  explanations  for  the  higher  validity  of  Statement  I 


Table  V Interconelations,  10  Rating  Statements,  and  Final  School  Grade 

!*•««) 
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82 

84 
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1.00 
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.80 

b9 

92 

80 

.87 
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.64 

4 

1 (HI 

73 

.74 

.82 

.70 

.75 

88 

41 

5 

1 00 

74 

84 

68 

78 

.78 

.44 

6 

1.00 

.71 

58 

.63 

.78 

\9 

7 

1.00 

81 

86 

87 

.61 

H 

1.00 

.87 

.81 

.52 

9 

1 .00 

.82 

54 

10 

1.00 

.50 

ist; 

1 .00 

Mean 

3.2 

3.1 

3.3 

3.5 

3.5 

3.6 

3.4 

3.3 

3 3 

3 5 

328.3 

SI) 
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54 

47 

43 

44 

44 

44 

.56 

45 

46 
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Table  ft.  Intercorrelations.  Five  Rating  Statements, 
and  Final  School  tirades 
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1.00 
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.80 
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.38 
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1.00 

.38 
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1.00 

Mean 

3.2  3.1 
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3.4 

324. 1 

SD 

.55  .52 
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The  most  obvious  explanation  is  that  Statement  1 (Learning  Ability  acquires  knowledge 
accurately  and  quickly)  describes  the  factor  among  the  set  of  20  which  is  most  important  for 
success  in  this  school  environment.  However,  the  learning  at  the  Nf'O  academy  does  not  appeal  to 
be  of  the  kind  which  taxes  learning  ability,  such  as  difficult  academic  subjects  might.  Looking  in 
from  the  outside,  it  appears  that  several  others  should  be  at  least  as  important  for  success  in  this 
particular  school  (e.g.,  leadership,  quality  of  work,  motivation,  knowledge  of  duties).  Furthermore, 
if  the  raters  really  are  making  a distinction  between  learning  ability  and  the  other  statements,  it  is 
difficult  to  explain  the  high  intercorrelations  among  the  statements. 

An  alternative  explanation  is  that  the  order  of  presentation  of  the  statements  explains  the 
higher  validity  of  learning  ability  for  the  success  criterion.  This  argument  implies  that  whatever 
statement  was  presented  first  would  exhibit  the  highest  validity,  and  learning  ability  just  happened 
to  be  the  first  in  the  series.  Assuming  that  the  niters  really  cannot  consider  the  separate 


statements  independently,  it  would  be  natural  enough  for  them  to  rate  the  first  factor  in  terms  of 
a global  perception  of  the  ratee’s  general  excellence,  which  should  exhibit  the  highest  validity  of 
which  the  rater  is  capable.  When  the  rater  is  then  faced  with  the  task  of  rating  the  second  and 

succeeding  statements,  the  rater  perceives  an  implication  that  these  other  statements  should  be 

somehow  different  from  the  first.  So  an  implicit  requirement  is  generated  by  the  mechanics  of  the 
situation  to  rate  the  second  and  succeeding  statements  different  from  the  first,  but  there  is  not 
(by  assumption)  an  ability  to  do  so  accurately.  If  this  scenario  is  accurate,  intercorrelation 
matrices  similar  to  those  displayed  in  Tables  4,  5,  and  6 should  result. 

There  is  no  way  within  the  limits  of  this  study  to  determine  which  of  these  two 

explanations  is  correct,  but  another  study  in  the  same  context  varying  the  order  of  presentation 
of  the  statements  might  clarify  the  relationships  and  should  be  easy  enough  to  accomplish. 

The  results  of  four  regression  analyses  appear  in  Table  7.  The  first  question  to  be  addressed 
by  this  table  is  “When  five  rating  statements  are  available  on  a set  of  ratees,  is  anything  of  value 
added  by  considering  an  additional  15  rating  statements  when  one  is  predicting  some  meaningful 
criterion  such  as  school  success?" 


Table  7.  Regression  Analyses  of  Varying  Numbers 
of  Rating  Statements 


Problem 

Rating 

Statamants 

R2 

N 

Diffaranca 

F 

A 

1-20 

.693 

45 

1-5 

.605 

45 

.088 

.458“ 

B 

1-10 

.575 

89 

1-5 

.549 

89 

.026 

.872“ 

C 

1.  19 

.614 

45 

1 alone 

.566 

45 

.048 

5.212* 

D 

1,  19,9 

.623 

45 

1.19 

.614 

45 

.009 

.991“ 

aNot  significant. 

* Significant  at  the  .05  level. 


Only  45  of  the  subjects  (Treatment  Condition  3)  were  available  to  investigate  this  question, 
since  only  45  subjects  rated  their  peers  on  all  20  rating  statements.  Problem  A in  Table  4 shows 
clearly  that  there  is  no  significant  difference  between  the  full  model  RJ  (using  all  20  statements) 
and  the  restricted  model  R2  (using  only  five  statements). 

The  next  question  that  arises  is,  “Are  10  statements  better  than  five  in  predicting  school 
success?”  Data  pertinent  to  this  question  were  available  from  89  subjects,  and  are  shown  in 
problem  B.  Again,  there  is  clearly  no  significant  advantage  in  using  10  statements,  rather  than  five. 

Finally,  it  seems  important  to  ask,  “What  is  the  smallest  subset  of  the  20  rating  statements 
which  carries  the  predictive  burden  of  the  entire  set?"  Problems  C and  D in  Table  7 address  this 
issue.  The  45  subjects  in  Treatment  Condition  3 were  used,  since  these  were  the  only  subjects 
who  rated  the  entire  set  of  20  statements. 
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When  i at  Inn  statement  I {teaming  Ability)  is  the  only  vauabte  m the  piediction  system,  the 
K1  IS  S60  Adding  rating  statement  l‘>  (Knowledge  ot  Duties)  In  ilte  |irediellon  system  ineieases 
D, e R»  i„  .614,  an  inciease  which  is  significant  at  the  <D  level  When  both  lating  slalemenls  I 
and  I'l  aie  in  the  prediction  system,  the  addition  ot  any  ol  the  olltet  IS  statements  does  not 
intpiove  piediction  significantly. 

il  is  not  unusual  that  the  ivsulls  ol  a study  support  a simply  stated  hypothesis  only 
partially  Ibis  study  began  with  the  hypothesis  dial  untrained  niters  can  tale  only  on  some 
general  idea  ot  excellence  and  that  uitings  cannot  be  made  better  by  le.pitiing  the  inlet  to  tale 

seveial  sepaiate  chacacteiistics.  Ibis  hypothesis  was  tested,  using  two  external  ciiteiia  ot 

goodness ot  lating  acmes 

When  an  external  criteiion  ol  lecognilion  ot  lalee  piotiles  is  used  lo  evaluate  the  goodness 
ot  sets  ot  i at  lugs,  the  insults  indicate  that  sets  ol  rating  statements  laigri  than  live  do  not 
provide  hettei  tccoguition  ol  peeis.  lire  analysis  ot  variance  design  did  not  penult  any  conclusions 
concerning  whethei  live  rating  statements  pioduced  aignillcanlly  bettei  lecognilion  than  one 
statement. 

When  the  external  ciilerion  is  class  standing  and  the  .'0  rating  slalemenls  are  subjected  to 
multiple  liueai  legtesxion  analysis,  lire  results  indicate  tutc(|tilvncatly  that  laige  sets  ot  rating 
statements  do  not  pmvide  bettei  measinement  than  small  sets.  Apparently,  Irowevei,  a single  laiing 
statement  does  not  carry  as  much  predictive  powei  as  two.  One  does  not  know,  ot  couise, 
whether  a single  lating  statement  deliberately  designed  to  be  as  broad  and  global  as  possible  might 
have  piovided  all  the  prediction  attainable  fiom  all  combinations  ot  "lactoi"  slalemenls  since  the 
.’ll  slalemenls  studied  weie  selected  partially  on  the  basis  ot  their  apparent  independence  o(  each 
other . Hut  that  does  not  change  the  tact  that  the  beginning  hypothesis  had  to  be  tejected  In 
tavor  ol  an  alternate  one  that  states  rate  is  can  effectively  use  only  a very  small  subset  of  the  '0 
rating  statements  investigated  in  this  study  future  studies  will  deteimine  whethei  a single  global 
mttug  statement  will  provide  all  the  useful  information  available  fiom  an\  numltet  ol  “tacim" 
taling  slatemenls  and  will  find  out  in  seveial  diflerem  cmilexts  what  is  the  most  likely  maximum 
nuntbei  <*!  useful  "lactoi"  rating  statements. 
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APPENDIX  A:  EVALUATION  FORMS 


EVALUATION  FORM 


W*N 


MlOW 

AnrH' 

Avarst* 

l! 

i! 

Outstanding 

1 . Learning  Ability  acquires  knowledge  accurately 
and  quickly 

(A) 

(B) 

(O 

(D) 

(E) 

2.  Leadership  effectiveness  in  getting  ideas  accepted 
and  in  guiding  others  to  accomplish  a task 

(A) 

(B) 

(C) 

(D) 

(H) 

3.  Quality  of  Work  produces  work  of  high  quality 

(A) 

(B) 

(C) 

(D) 

(E) 

4.  Motivation  strong  desires  to  accomplish  goals 

and  objectives 

(A) 

(B) 

(C) 

(D) 

(E) 

5.  Follows  Instructions  follows  directions  as 

prescribed 

(A) 

(B) 

(C) 

(D) 

(E) 

b.  Hearing  and  Behavior  maintains  professional 

conduct  and  appearance 

(A) 

(B) 

(C) 

(D) 

(E) 

7.  Accuracy  precision  and  carefulness  in  work 

performance 

(A) 

(B) 

(C) 

(l» 

(E) 

8.  Oral  Communication  expresses  ideas  clearly, 

logically,  and  grammatically  in  conversation 

(A) 

(B) 

(C) 

<D) 

(E) 

9.  Problem  Analysis  identities  and  analyzes 

problems  which  require  action 

(A) 

(B) 

(C) 

(D) 

(E) 

10.  Initiative  self-starting,  rarely  needs  a push  to  get 
going 

(A) 

(B) 

(C) 

(O 

(E) 

1 1.  Quantity  of  Work  accomplishes  a large  amount 
of  work 

(A) 

(B) 

(O 

(i» 

(E) 

12.  Written  Communication  expresses  ideas  clearly 
in  writing  with  good  grammatical  form 

(A) 

(B) 

(C) 

U» 

(E) 

13.  Punctuality  prompt  in  keeping  engagements 

(A) 

(B) 

(C) 

(«» 

(E) 

14.  Adaptability  changes  attitude  and  behavior  to 
meet  the  demands  of  the  situation 

(A) 

(B) 

(O 

(D) 

(E) 

15.  Dependability  does  assigned  tasks 

conscientiously  without  close  supervision 

(A) 

(B) 

(C) 

(D) 

(E) 

16.  Emotional  Stability  stability  and  calmness 

under  pressure  and  opposition 

(A) 

(B) 

(C) 

<n> 

(E) 

1.1 


Evaluation  Form 


Well 

Below  Above  Above 

Average  Average  Average  Average  Outstanding 


17.  Human  Relations  - gets  along  well  with  fellow 


workers  and  works  effectively  with  them 

(A) 

(B) 

(C) 

(D) 

(E) 

18.  Judgment  — makes  good  decisions  among 
competing  alternatives 

(A) 

(B) 

(C) 

(D) 

(E) 

19.  Knowledge  of  Duties  — understands  the 
requirements  for  effective  work  performance 

(A) 

(B) 

(C) 

(D) 

(E) 

20.  Honesty  — straight-forward  and  truthful  in  dealing 
with  others 

(A) 

(B) 

(C) 

(D) 

(E) 

Evaluation  Fora 
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1 Tam  ■ ■ « tnft  AKtltrw  . * 1 
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■ ■ ■ j >-■- 
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■ ■ — i 
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acquires  knowledge  sccurstely  and  quickly  1 

1 
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. a i 

T **!■*- 

• * * 1 

ness  In  getting  Ideas  accepted  and  in  guiding  others  to  accompllsh'a 
3.  Quality  of  Work  - | — i i » » [-^ 

>>11 — 

1 

‘-1 

1 1 A- 

produces  work  of  high  qusney 

4.  Motivation  - strong  i ‘ 1 » 

desires  to  accomplish  goals  and  objectives 


1 ■ ■ ■ » » 1 1 ■ 1 I 1 1 * 1 1 1 * 1 1 I 


5.  Follows  Instructions 

follows  directions  as  pre 


1 » » i i i » ■ > » I 1 * ■ 1 1 1 * 1 

scribed 
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